Time series

From Python for Data Analysis:

Time series data is an important form of structured data in many different dielfds, such as finance, economics, ecology, neuroscience, and physics. Anything that is observed or measured at many points in time forms a time series. Many time series are fixed frequency, which is to say that data points occur at regular intervals according to some rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can also be irregular without a fixed unit or time or offset between units. How you mark and refer to time series data depends on the application and you may have one of the following:

  • timestamps, specific instants in time
  • fixed periods, such as the month January 2007 or the full year 2010
  • intervals of time, indicated by a start and end timestamp. Periods can be thought of as special cases of intervals
  • Experiment or elapsed time; each timestamp is a measure of time relative to a particular start time. For example, the diameter of a cookie baking each second since being placed in the oven

Pandas provides a standard set of time series tools and data algorithms. With this you can efficiently work with very large time series and easily slice and dice, aggregate, and resample irregular and fixed frequency time series. As you might guess, many of these tools are especially useful for financial and economics applications, but you could certainly use them to analyze server log data, too.

In [ ]:
from __future__ import division
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(12, 4))

In [ ]:
%matplotlib inline

Date and Time Data Types and Tools

In general, dealing with date arithmetic is hard. Luckily, Python has a robust library that implements datetime objects, which handle all of the annoying bits of date manipulation in a powerful way.

In [ ]:
from datetime import datetime
now = datetime.now()

Every datetime object has a year, month, and day field.

In [ ]:
now.year, now.month, now.day

You can do arithmetic on datetime objects, which produce timedelta objects.

In [ ]:
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)

timedelta objects are very similar to datetime objects, with similar fields:

As you expect, arithmetic between datetime and timedelta objects produce datetime objects.

In [ ]:
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(12)

In [ ]:
start - 2 * timedelta(12)

Converting between string and datetime

In general, it is easier to format a string from a datetime object than to parse a string date into a datetime object.

In [ ]:
stamp = datetime(2011, 1, 3)

To format a string from a datetime object, use the strftime method. You can use the standard string-formatting delimiters that are used in computing.

To parse a string into a datetime object, you can use the strptime method, along with the relevant format.

In [ ]:
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')

Of course, this being Python, we can easily abstract this process to list form using comprehensions.

In [ ]:
datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

Without question, datetime.strptime is the best way to parse a date, especially when you know the format a priori. However, it can be a bit annoying to have to write a format spec each time, especially for common date formats. In this case, you can use the parser.parse method in the third party dateutil package:

In [ ]:
from dateutil.parser import parse

dateutil is capable of parsing almost any human-intelligible date representation:

In [ ]:
parse('Jan 31, 1997 10:45 PM')

In international locales, day appearing before month is very common, so you can pass dayfirst=True to indicate this:

In [ ]:
parse('6/12/2011', dayfirst=True)

Pandas is generally oriented toward working with arrays of dates, whether used as an index or a column in a DataFrame. The to_datetime method parses many different kinds of date representations. Standard date formats like ISO8601 can be parsed very quickly.

Notice that the Pandas object at work behind the scenes here is the DatetimeIndex, which is a subclass of Index. More on this later. to_datetime also handles values that should be considered missing (None, empty string, etc.):

In [ ]:
idx = pd.to_datetime(datestrs + [None])

datetime objects also have a number of locale-specific formatting options for systems in other countries or languages. For example, the abbreviated month names will be different on German or French systems compared with English systems.

Time Series Basics

The most basic kind of time series object in Pandas is a Series indexed by timestamps, which is often represented external to Pandas as Python strings or datetime objects.

In [ ]:
from datetime import datetime
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = Series(np.random.randn(6), index=dates)

Under the hood, these datetime objects have been put in a DatetimeIndex, and the variable ts is now of type TimeSeries.

In [ ]:
# note: output changed to "pandas.core.series.Series"

Like other Series, arithmetic operations between differently-indexed time series automatically align on the dates:

In [ ]:
ts + ts[::2]

Pandas stores timestamps using NumPy's datetime64 date type at the nanosecond resolution:

In [ ]:
# note: output changed from dtype('datetime64[ns]') to dtype('<M8[ns]')

Scalar values from a DatetimeIndex are Pandas Timestamp objects

In [ ]:
stamp = ts.index[0]
# note: output changed from <Timestamp: 2011-01-02 00:00:00> to Timestamp('2011-01-02 00:00:00')

A Timestamp can be substituted anywhere you would use a datetime object. Additionally, it can store frequency information (if any) and understands how to do time zone conversions and other kinds of manipulations. More on both of these things later.

Indexing, selection, subsetting

TimeSeries is a subclass of Series and thus behaves in the same way with regard to indexing and selecting data based on label:

In [ ]:
stamp = ts.index[2]

As a convenience, you can also pass a string that is interpretable as a date:

For longer time series, a year or only a year and month can be passed to easily select slices of data:

In [ ]:
longer_ts = Series(np.random.randn(1000),
                   index=pd.date_range('1/1/2000', periods=1000))

Slicing with dates works just like with a regular Series

In [ ]:
ts[datetime(2011, 1, 7):]

Because most time series data is ordered chronologically, you can slice with timestamps not contained in a time series to perform a range query:

As before you can pass either a string date, datetime, or Timestamp. Remember that slicing in this manner produces views on the source time series just like slicing NumPy arrays. There is an equivalent instance method truncate which slices a TimeSeries between two dates:

All of the above holds true for DataFrame as well, indexing on its rows:

In [ ]:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
long_df = DataFrame(np.random.randn(100, 4),
                    columns=['Colorado', 'Texas', 'New York', 'Ohio'])

Time series with duplicate indices

In some applications, there may be multiple data observations falling on a particular timestamp. Here is an example:

In [ ]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000',
dup_ts = Series(np.arange(5), index=dates)

We can tell that the index is not unique by checking its is_unique property:

Indexing into this time series will now either produce scalar values or slices depending on whether a timestamp is duplicated:

In [ ]:
dup_ts['1/3/2000']  # not duplicated

In [ ]:
dup_ts['1/2/2000']  # duplicated

Suppose you want to aggregate the data having non-unique timestamps. One way to do this is to use groupby and pass level=0 (the only level of indexing!):

In [ ]:
grouped = dup_ts.groupby(level=0)

Date ranges, Frequencies, and Shifting

Generic time series in Pandas are assumed to be irregular; that is, they have no fixed frequency. For many applications this is sufficient. However, it's often desirable to work relative to a fixed frequency, such as daily, monthly, or even 15 minutes, even if that means introducing missing values into a time series. Fortunately Pandas has a full suite of standard time series frequencies and tools for resampling, inferring frequencies, and generating fixed frequency date ranges. For example, in the example time series, converting it to be fixed daily frequency can be accomplished by calling resample:

Conversion between frequencies or resampling is a big enough topic to have its own section later. Here, we'll see how to use the base frequencies and multiples thereof.

Generating date ranges

You may have guessed that pandas.date_range is responsible for generating a DatetimeIndex with an indicated length according to a particular frequency:

In [ ]:
index = pd.date_range('4/1/2012', '6/1/2012')

By default, date_range generates daily timestamps. If you pass only a start or end date, you must pass a number of periods to generate:

In [ ]:
pd.date_range(start='4/1/2012', periods=20)

In [ ]:
pd.date_range(end='6/1/2012', periods=20)

The start and end dates define strict boundaries for the generated date index. For example, if you wanted a date index containing the last business day of each month, you would pass the 'BM' frequency (business end of month) and only dates falling on or inside the date interval will be included:

In [ ]:
pd.date_range('1/1/2000', '12/1/2000', freq='BM')

date_range by default preserves the time (if any) or the start or end timestamp:

In [ ]:
pd.date_range('5/2/2012 12:56:31', periods=5)

Sometimes you will have start or end dates with time information but want to generate a set of timestamps normalized to midnight as a convention. To do this, there is a normalize option:

In [ ]:
pd.date_range('5/2/2012 12:56:31', periods=5, normalize=True)

Frequencies and Date Offsets

Frequencies in Pandas are composed of a base frequency and a multiplier. Base frequencies are typically referred to by a string alias, like 'M' for monthly or 'H' for hourly. For each base frequency, there is an object defined generally referred to as a date offset. For each example, hourly frequency can be represented with the Hour class:

In [ ]:
from pandas.tseries.offsets import Hour, Minute
hour = Hour()

You can define a multiple of an offset by passing an integer:

In [ ]:
four_hours = Hour(4)

In most applications, you would never need to explicitly create one of these objects, instead using a string alias like 'H' or '4H'. Putting an integer before the base frequency creates a multiple:

In [ ]:
pd.date_range('1/1/2000', '1/3/2000 23:59', freq='4h')

Many offsets can be combined together by addition:

In [ ]:
Hour(2) + Minute(30)

Similarly, you can pass frequency strings like '2h30min' which will effectively be parsed to the same expression.

In [ ]:
pd.date_range('1/1/2000', periods=10, freq='1h30min')

Some frequencies describe points in time that are not evenly spaced. For example, 'M' (calendar month end) and 'BM' (last business/weekday of month) depend on the number of days in a month and, in the latter case, whether the month ends on a weekend or not. For lack of a better term, we will call these anchored offsets.

Week of month dates

One useful frequency class is "week of month", starting with WOM. This enables you to get dates like the third Friday of each month:

In [ ]:
rng = pd.date_range('1/1/2012', '9/1/2012', freq='WOM-3FRI')

Traders of US equity options will recognize thse dates as the standard dates of monthly expiry.

Shifting (leading and lagging) data

"Shifting" refers to moving data backward and forward through time. Both Series and DataFrame have a shift method for doing naive shifts forward or backward, leaving the index unmodified:

In [ ]:
ts = Series(np.random.randn(4),
            index=pd.date_range('1/1/2000', periods=4, freq='M'))

A common use of shift is computing percent changes in a time series or multiple time series as DataFrame columns. This is expressed as

ts / ts.shift(1) - 1

Because naive shifts leave the index unmodified, some data is discarded. Thus if the frequency is known, it can be passed to shift to advance the timestamps instead of simply the data

In [ ]:
ts.shift(2, freq='M')

Other frequencies can be passed, too, giving you a lot of flexibility in how to lead and lag the data

In [ ]:
ts.shift(3, freq='D')

In [ ]:
ts.shift(1, freq='3D')

In [ ]:
ts.shift(1, freq='90T')

Shifting dates with offsets

The Pandas date offsets can also be used with datetime or Timestamp objects:

In [ ]:
from pandas.tseries.offsets import Day, MonthEnd
now = datetime(2011, 11, 17)
now + 3 * Day()

If you add an anchored offset like MonthEnd, the first increment will roll forward a date to the next date according to the frequency rule:

In [ ]:
now + MonthEnd()

In [ ]:
now + MonthEnd(2)

Anchored offsets can explicitly "roll" dates forward or backward using their rollforward and rollback methods, respectively:

In [ ]:
offset = MonthEnd()

A clever use of date offsets is to use these methods with groupby:

In [ ]:
ts = Series(np.random.randn(20),
            index=pd.date_range('1/15/2000', periods=20, freq='4d'))

Of course, an easier and faster way to do this is using resample (more on this to come).

In [ ]:
ts.resample('M', how='mean')

Time Zone Handling

Working with time zones is a pain. As Americans hold on dearly to daylight savings time, we must pay the price with difficult conversions between time zones. Many time series users choose to work with time series in coordinated universal time (UTC) of which time zones can be expressed as offsets.

In Python we can use the pytz library, based off the Olson database of world time zone data.

In [ ]:
import pytz

To get a time zone object from pytz, use pytz.timezone.

In [ ]:
tz = pytz.timezone('US/Eastern')

Methods in Pandas will accept either time zone names or these objects. Using the names is recommended.

Localization and Conversion

By default, time series in Pandas are time zone naive. Consider the following time series:

In [ ]:
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = Series(np.random.randn(len(rng)), index=rng)

The index's tz field is None:

Date ranges can be generated with a time zone set:

In [ ]:
pd.date_range('3/9/2012 9:30', periods=10, freq='D', tz='UTC')

Conversion from naive to localized is handled by the tz_localize method

In [ ]:
ts_utc = ts.tz_localize('UTC')

Once a time series has been localized to a particular time zone, it can be converted to another time zone using tz_convert.

In [ ]:

In this case of the above time series, which straddles a DST transition in the US/Eastern time zone, we could localize to EST and convert to, say, UTC or Berlin time.

In [ ]:
ts_eastern = ts.tz_localize('US/Eastern')

tz_localize and tz_convert are also instance methods on DatetimeIndex.

Operations with time zone-aware Timestamp objects

Similar to time series and date ranges, individual Timestamp objects similarly can be localized from naive to time zone-aware and converted from one time zone to another:

In [ ]:
stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')

You can also pass a time zone when creating the Timestamp.

In [ ]:
stamp_moscow = pd.Timestamp('2011-03-12 04:00', tz='Europe/Moscow')

Time zone-aware Timestamp objects internally store a UTC timestamp value as nanoseconds since the UNIX epoch (January 1, 1970); this UTC value is invariant between time zone conversions:

When performing time arithmetic using Pandas' DateOffset objects, daylight savings time transitions are respected where possible

In [ ]:
# 30 minutes before DST transition
from pandas.tseries.offsets import Hour
stamp = pd.Timestamp('2012-03-12 01:30', tz='US/Eastern')

In [ ]:
stamp + Hour()

In [ ]:
# 90 minutes before DST transition
stamp = pd.Timestamp('2012-11-04 00:30', tz='US/Eastern')

In [ ]:
stamp + 2 * Hour()

Operations between different time zones

If two time series with different time zones are combined, the result will be UTC. Since the timestamps are stored under the hood in UTC, this is a straightforward operation and requires no conversion to happen.

In [ ]:
rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
ts = Series(np.random.randn(len(rng)), index=rng)

In [ ]:
ts1 = ts[:7].tz_localize('Europe/London')
ts2 = ts1[2:].tz_convert('Europe/Moscow')
result = ts1 + ts2